Add testing DSS with Canonical Kubernetes (New) #1793

motjuste · 2025-03-14T13:22:31Z

Description

dss now supports running on Canonical Kubernetes instead of microk8s. This support is currently available in channels for version 1.1. This PR adds support for testing dss on Canoncial Kubernetes.

Updates to the provider

The install-deps script now accepts an argument to install Canonical Kubernetes instead of microk8s. It still installs microk8s by default.
install-deps now also installs the helm snap which will be used to enable NVIDIA GPU support in both the Kubernetes variants.
There's now a single k8s_gpu_setup.py that can be used to setup GPU from both Intel and NVIDIA on both microk8s and Canonical Kubernetes.
- Enabling Intel GPU remains largely the same, if not made simpler by using kubectl apply with -k directly. Support for setting a specific number of slots-per-GPU has been removed as it is not relevant to testing DSS at the moment.
- helm is used for enabling NVIDIA GPU support in the Kubernetes cluster, roughly following this guide.
- microk8s is detected by the script and relevant customisation for containerd is done by the script automatically.

The provider-snap's minor version has been bumped to indicate since when we added support for Canonical Kubernetes.

Updates to the GitHub Workflows

Added a reusable workflow checkbox-dss-build.yaml to build the checkbox-dss snap.
The existing workflow to run the tests in Testflinger has been updated to:
- Use the snap-building workflow to build the snap only once and pass it as an attachment in the Testflinger jobs created by the matrix.
- Get rid of an explicit matrix of different snap channels, and accept them as inputs for workflow dispatch instead. (However, the matrix of different Testflinger queues is kept in place.)

Resolved issues

CHECKBOX-1781

Documentation

Updated the README for this provider. No changes to main Checkbox documentation.

Tests

One of the machines has been failing to provision today, but the tests have passed on the other two machines using the updated workflow.

Workflow testing DSS 1.0/stable on microk8s 1.28/stable
Workflow testing DSS 1.1/stable on Canonical K8s 1.32-classic/stable

We will need helm when installing canonical k8s to enable NVIDIA GPU operator in it. Canonical k8s (and helm) will only be installed if the explicit argument for the channel to use is provided. Otherwise, the old default behaviour of install microk8s will be maintained.

We use helm to add the relevant chart from nvidia and install the chart. We re-use existing script to verify the rollout too.

This job needs to run after either one of the two jobs above it enabling NVIDIA GPU in the k8s cluster succeed (one is for microk8s, the other is for Canonical k8s). We can't use those jobs in 'depends' because then both of them will have to succeed, which is impossible because only one of either microk8s or Canonical k8s will be available. The trick we use here is that we now `depends` on `dss/initialize`, which must succeed for the whole test-plan to be run anyway, and, we require that we have an NVIDIA GPU. This is similar to the `depends` for the two jobs for microk8s and Canonical k8s. Then we will have to be careful that this job is added in the test-plan to ONLY run after those two jobs. The difference will then be that this job will not be skipped if either of the two jobs enabling NVIDIA GPU fail.

We now have an addition to the `install-deps` script. It also demarcates from whence we started supporting Canonical K8s

The "worker" daemonset that was being verified may have a version number in its name, which we cannot predict.

codecov · 2025-03-14T13:24:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.67%. Comparing base (d600c6b) to head (3b033ed).
⚠️ Report is 134 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   50.44%   50.67%   +0.23%     
==========================================
  Files         382      384       +2     
  Lines       41026    41219     +193     
  Branches     6890     6890              
==========================================
+ Hits        20696    20889     +193     
  Misses      19585    19585              
  Partials      745      745

Flag	Coverage Δ
provider-dss	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NVIDIA GPU operator can be enabled in both microk8s and canonical k8s using helm, so we remove all the ugly parts so far that was trying to handle whether microk8s or canonical k8s were installed, and just use the unified approach using helm. Helm now becomes a hard requirement.

fernando79513 · 2025-03-19T10:42:23Z

As discussed in:

Simplify enabling Intel GPU for DSS (New) #1789 (comment)
Simplify enabling Intel GPU for DSS (New) #1789 (comment)
We will refactor enable_intel_gpu_plugin.sh and check_gpu_rollout.sh to use helm on this PR

We do not need to customize the slots_per_gpu today, so let's keep it very simple, and directly apply the Kustomize configurations

KUBECONFIG is actually used by kubectl, and other tools, and this is not the right use of it

motjuste · 2025-04-25T13:57:50Z

As discussed in:

Simplify enabling Intel GPU for DSS (New) #1789 (comment)

Simplify enabling Intel GPU for DSS (New) #1789 (comment)
We will refactor enable_intel_gpu_plugin.sh and check_gpu_rollout.sh to use helm on this PR

@fernando79513 ... I was able to compress setting up K8s for NVIDIA and Intel GPUs, and moved them into a Python script (see this commit) ... Is this what you were expecting?

Personally, now that the setup is nicely compressed, I don't see too much value in wrapping them in a Python script.

Let me know if you still prefer this Python script, and what sort of unit-tests you believe it requires.

Modified from copy of checkbox-ce-oem native build

Currently only requires passing environment variables pointing to the containerd config and socket paths for microk8s. This customisation is not required on Canonical K8s.

The original job is renamed and should continue to work. We need a slightly different job for installing the gpu operator on microk8s.

This reverts commit 02138ef.

We are going to detect it

The labels take some time to propagate

If there's a TimeoutError, `microk8s status` was still executing, so it is there ... even thought it does not tell us whether microk8s is in use or not. Anyway, throw the error, instead of deciding that there is no microk8s.

The validator container may not have been created even after the daemonset is rolled out (for some unknown reason), hence we wait before checking the logs. And, since checking the logs will wait for the validations so succeed, this job may take considerably longer.

It is FileNotFoundError that is raised when microk8s is not installed. So we are not going to try catch all other CalledProcessErrors, for now.

we need snapd > 2.59 for SNAP_UID

motjuste · 2025-05-05T10:40:19Z

@fernando79513 ...

Sorry for the delay, but I got stuck in some weird behaviour of Helm-installing the Nvidia GPU operator (see my comment in the script).

Furthermore, I needed to add some special handling for installing the Nvidia operator on microk8s because it has a different setup for containerd. The script now automatically detects if microk8s is running, and appropriately configures the operator.

motjuste · 2025-05-22T14:49:06Z

Closing this without merging. To be picked again as part of CHECKBOX-1898

motjuste added 11 commits March 14, 2025 15:33

Add job to enable NVIDIA GPU in Canonical k8s

0bdaf9b

We use helm to add the relevant chart from nvidia and install the chart. We re-use existing script to verify the rollout too.

Fix script name for checking nvidia gpu rollout

8666050

Remove requirement for microk8s from dss/initialize

0d5ce9d

Remove sudo in helm commands and pass kubeconfig

f34be48

Update default dss channel to 1.0/stable

5714624

Bump version of snap to 3.1

0f41093

We now have an addition to the `install-deps` script. It also demarcates from whence we started supporting Canonical K8s

Update README documenting Canonical K8s

a7af76e

Add sleep before checking gpu rollout

82b0c01

Change to checking rollout status of parent daemonset

a3a3610

The "worker" daemonset that was being verified may have a version number in its name, which we cannot predict.

motjuste added 2 commits March 14, 2025 19:01

Minor update to summary of job

612a608

motjuste mentioned this pull request Mar 14, 2025

Simplify enabling Intel GPU for DSS (New) #1789

Merged

fernando79513 self-assigned this Mar 19, 2025

motjuste added 7 commits April 22, 2025 13:18

Merge branch 'main' into CHECKBOX-1781-add-canonical-k8s

b703052

Pin nvidia operator ver & in-line checking rollout

7026636

Inline checking intel gpu plugin rollout

d6b9f87

Merge branch 'main' into CHECKBOX-1781-add-canonical-k8s

9279b2b

Inline enabling intel gpu with kubectl apply

f218899

We do not need to customize the slots_per_gpu today, so let's keep it very simple, and directly apply the Kustomize configurations

In-line reading and passing kubeconfig

7eacbc8

KUBECONFIG is actually used by kubectl, and other tools, and this is not the right use of it

Move gpu setup in k8s to Python script

917e7ae

motjuste added 5 commits April 25, 2025 17:49

Fix commands passed to subprocess

2796e1d

Add workflow to build checkbox-dss snap

5363802

Modified from copy of checkbox-ce-oem native build

Fix formatting

68d3de2

Add running checkbox-dss build on push and pr

75ba576

Remove unused jobs for publishing snap

5296d16

motjuste added 13 commits April 30, 2025 16:00

Add tests for installing intel gpu plugin

a0e6305

Refactor installing intel gpu plugin

8eee8fc

Add tests for installing nvidia gpu operator

d430570

Refactor install nvidia gpu operator

c21a0bf

Add flag --is-microk8s for nvidia gpu install

b2e8a89

Refactor test for nvidia gpu install

f7ca5ae

Add configuring nvidia gpu operator for microk8s

2f73aa0

Currently only requires passing environment variables pointing to the containerd config and socket paths for microk8s. This customisation is not required on Canonical K8s.

Add job to install nvidia gpu on microk8s

02138ef

The original job is renamed and should continue to work. We need a slightly different job for installing the gpu operator on microk8s.

Revert "Add job to install nvidia gpu on microk8s"

2b00d26

This reverts commit 02138ef.

Remove --is-microk8s argument

a7f337f

We are going to detect it

Add detect_if_microk8s

2bc9701

Add detecting microk8s in install nvidia gpu

7d8abff

Use subprocess.run to accept input

676cb5d

motjuste marked this pull request as draft April 30, 2025 13:27

motjuste added 11 commits April 30, 2025 19:20

Add some informative print-outs with version etc

a14dcca

Flush the print-outs to show them early

28737d0

Add sleep before checking intel gpu label

2c02f58

The labels take some time to propagate

Don't catch TimeoutError in detecting microk8s

6eb0d44

If there's a TimeoutError, `microk8s status` was still executing, so it is there ... even thought it does not tell us whether microk8s is in use or not. Anyway, throw the error, instead of deciding that there is no microk8s.

Fix wrong timeouts for installing gpu

1965adb

Add extra rollout check with sleep before starting

b6ff79c

Actually catch FileNotFoundError in detecting microk8s

c510057

It is FileNotFoundError that is raised when microk8s is not installed. So we are not going to try catch all other CalledProcessErrors, for now.

Increase sleep before checking nvidia gpu rollout

fb191ff

Refresh snapd before installing deps

62841cd

we need snapd > 2.59 for SNAP_UID

Mark __main__ block for no coverage

3b033ed

motjuste marked this pull request as ready for review May 5, 2025 10:40

motjuste closed this May 22, 2025

motjuste deleted the CHECKBOX-1781-add-canonical-k8s branch August 21, 2025 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add testing DSS with Canonical Kubernetes (New) #1793

Add testing DSS with Canonical Kubernetes (New) #1793

Uh oh!

motjuste commented Mar 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 14, 2025 •

edited

Loading

Uh oh!

fernando79513 commented Mar 19, 2025

Uh oh!

motjuste commented Apr 25, 2025

Uh oh!

motjuste commented May 5, 2025

Uh oh!

motjuste commented May 22, 2025 •

edited by atlassian bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add testing DSS with Canonical Kubernetes (New) #1793

Add testing DSS with Canonical Kubernetes (New) #1793

Uh oh!

Conversation

motjuste commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Updates to the provider

Updates to the GitHub Workflows

Resolved issues

Documentation

Tests

Uh oh!

codecov bot commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fernando79513 commented Mar 19, 2025

Uh oh!

motjuste commented Apr 25, 2025

Uh oh!

motjuste commented May 5, 2025

Uh oh!

motjuste commented May 22, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

motjuste commented Mar 14, 2025 •

edited

Loading

codecov bot commented Mar 14, 2025 •

edited

Loading

motjuste commented May 22, 2025 •

edited by atlassian bot

Loading